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We model certain features of human language complexity by means of advanced 
concepts borrowed from statistical mechanics. Using a time series approach, the 
diffusion entropy method (DE), we compute the complexity of an Italian corpus of 
newspapers and magazines. We find that the anomalous scaling index is compatible 
with a simple dynamical model, a random walk on a complex scale-free network, 
which is linguistically related to Saussurre's paradigms. The model yields the 
famous Zipf 's law in terms of the generalized central limit theorem. 



1 Introduction 

Semiotics studies linguistic signs, their meanings, and identifies the relations be- 
tween signs and meanings, and among signs. The relations among signs (letters, 
words), are divided into two large groups, namely the syntagmatic and the paradig- 
matic, corresponding to what are called Saussurre's dimensions ^ These dimen- 
sions are analogous to physical concepts like time and space. One can grasp an 
understanding of them by looking at Fig. ^ The abscissa axis represents the 
syntagmatic dimension, while the ordinate axis represents the paradigmatic one. 
Along the abscissa grammatical rules pose constraints on how words follow each 
other. This dimension is a temporal one, with a casual order. An article (as "a" 
or "the"), e.g., may be followed by an adjective or a noun, but not by a verb of 
finite form. At a larger "time-scale", pragmatic constraints rule the succession of 
concepts, to give logic to the discourse. The other axis, on the other hand, refers to 
a "mental" space. The speaker has in mind a repertoire of words, divided in many 
categories, which can be hierarchically complex and refer to syntactical or semantic 
"interchangeability" . Different space-scales of word paradigms can be associated to 
different levels of this hierarchy. After an article, to follow the preceding example, 
one can choose, at a syntactical level, among all nouns of a dictionary. However, at 
a deeper level, semantic constraints reduce the available words to be chosen. For 
instance, after "a dog" one can choose any verb, but in practice only among verbs 
selected by semantic constraints (a dog runs or sits, but does not read or smoke). 
The sentence "a dog graduates" , for instance, fits paradigmatic and syntagmatic 
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rules behind Fig. 1, but the semantics would in general forbid the production of 
such a "nonsensical" sentence. 

The two dimensions are therefore not quite orthogonal, and connect, e.g., at a 
cognitive level. The main focus of this paper is to show that this connection is in 
fact reproduced at all scales. We shall also show that both dimensions are scale free 
and that the complexity of linguistic structures in both dimensions can be taken 
into account in a unified model, which is able to explain most statistical features 
of human language, including, at the largest scales, the celebrated Zipf's law 1^1. 




Syntagmatic axis 



Figure 1. Saussurre's dimensions. In this example the first position in the syntagmatic axis is an 
article, the second a noun and the third a verb in the third person. 



Zipf's law relates the rank r of words to their frequency / in a corpus. Remark- 
ably, this does not mean that the probability of a word is actually defined. In fact, 
a word may have a small or large frequency depending on the genre of the corpus 
(i.e. a large collection of written text) under study, and even two extremely large 
corpora of the same type fail in reproducing the same word frequencies. It is how- 
ever remarkable that the occurrence of words is such that for any corpus and for 
any natural language a property emerge so that one finds only few frequent words 
and a large number of words encountered once or twice. Let us define word rank 
r, a property depending on the corpus adopted, as follows. One assigns rank 1 to 
the most frequent word, rank 2 to the second frequent one, and so on. Each word 
is uniquely associated to a rank, and, although this number varies form corpus to 
corpus, one always finds that 

/«i (1) 

r 

This property means that word frequencies do not tend to well-defined probabilities. 
We assume that what can be defined is a "probability of having a frequency" P{f) 
for a randomly selected word. Operatively, one measures P{f) by counting how 
many words have a certain frequency /. In Section III we show that a P{f) com- 
patible with the Zipf's law can be derived from the model proposed herein, thus 
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providing our model an experimental support. For our scopes, we assume a statis- 
tical mutual independence for the occurrence of different concepts. This hypothesis 
is appealing, since it means that every and each occurrence of a concept makes 
entropy increase, thus identifying the mathematical information (i.e. entropy) with 
the common-sense information (i.e. the occurrence of concepts). Unfortunately, a 
concept is not, a priori, a well defined quantity. Herein we assume that concepts are 
represented by words (or better by lemmata) or by groups of semantically similar 
words or lemmata". 

Because of the mutual independence among different concepts, we can extract 
from a single corpus as many "experiments" as the number of concepts. For each 
experiment we select only one concept and we mark the occurrence of the selected 
word or group of words corresponding to this concept. For the analysis we use 
the recently developed Diffusion Entropy (DE) method, which is able to identify 
whether a marker is a "real event" , i.e. it carries maximal information, and to ex- 
tract the scaling properties of the language dynamics. We show here that anomalous 
scaling (different from Brownian motion) is an indication of long-range correlations 
of the series, and that in fact these properties are well measured by the DE, even 
if the marker is not identified with absolute precision. 

The overall dynamics, given by the flow of concepts over time, experimentally 
mirrors the dynamics of intermittent dynamical systems, like the Manneville's Map: 
These systems have long periods of quiescence followed by bursts of activity. This 
variability of waiting times between markers of activity is responsible for long-range 
correlations 1^. 

The second aspect of the paper is the connection between space and time com- 
plexity, and its application to linguistics. We will assume that atomic concepts 
exist and represent nodes of a complex network, connected by arcs representing, 
when existing, semantic associations between a concept and another. We assume 
that our markers are actually defined as a group of neighboring nodes. We then as- 
sume that language can be produced by a random walker, "associatively" traveling 
from concept to concept. The scale-free properties of the network, independently 
measured by our research group, provides a bridge to understand the intermittent 
dynamics earlier described. In this unified model the network is a representation 
of Saussurre's paradigms, whose complexity mirrors the syntagmatic one in the 
asymptotic limit. 



2 DE and concepts 



Let us review the DE method In synthesis, one defines a "marker" on a 

time sequences, and studies the probability p{x; t) of having a number x of markers 
in a window of length t. This statistical analysis is done by moving a window of 
length t along the sequences, counting how many times one finds x markers inside 
this window, and dividing this number by the total number N — t + 1 of windows 
of size t, where N is the total length. 



"A lemma is defined as a representative word of a class of words, having different morphological 
features. For instance the word "dogs" has lemma "dog", and word "sleeping" has lemma "sleep". 
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Having large number values for x and t, we can adopt a continuous approxi- 
mation. Moreover, in the ergodic and stationary condition, a scaling relation is 
expected, namely 

\ f X ~ wt\ 

= (2) 

where w is the overall marker density, 5 is the scaling index and is a function. If 
F is the Gauss function, 5 is the known Hurst index, and if the further condition 
5 = 0.5 is obeyed, then the process is said to be Poissonian, and the dynamics of 
X is called "Brownian motion". If this condition applies, there is no long-range 
memory regulating the occurrence of markers in time. 

It is straightforward to show that S{t) = J_^dxp(x;t)lnp{x;t), namely the 
Shannon Information, with condition leads to 



S{t) = k + S\nt, (3) 

where fc is a constant. The evaluation of the slope according to which S increases 
with hit provides therefore a measure for the anomalous scaling S. 

Let us briefly mention what we know about applying DE to time series with 
known long-range correlation. We construct an artificial series by letting = 1 
(this means that we find the marker at the ith position), or = (the i-th sign is 
not a marker). We then assume "informativity" for the marker (markers are then 
called "events"), namely that the distance between a "1" and the successive does 
not depend on the such previous distances. Then, if the distances t between events 
are distributed as 

= (4) 

{ip{t) ~ t''^ asymptotically is a sufficient condition), then the theory based on 
continuous-time random walk and on the generalized central-limit theorem yields 
for p{x; t) a truncated Levy probability distribution function (PDF) El DE detects 
the scaling d of the central part, namely 



if2 < ^ < 3, 5 = 0.5 if > 3. (5) 

/i — i 

The condition 2 < /i < 3 means long-range correlation, since for truncated Levy 
PDFs asymptotically {x'^{t)) — {x{t))'^ oc and therefore the correlation function 
decays as t^^^. Note that the decay of this correlation function is non-integrable, 
yielding an infinite correlation time. The theory rests on a dichotomous ^, and 
experimentally this means the presence or absence of a certain marker. One may, 
for instance look for a certain letter, so that the time is the ordinal number of the 
typographical characters in the text. As later shown, we have better results by 
looking at lemmata, where the "time" is the ordinal number of words. We shall 
show that, with a good choice of semantic markers, Eq. Q is a good model for 
concepts dynamics in natural language. 
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Eq. rests on uncorrelated waiting times between events. This means that 

if two markers are separated by intervals of words of duration (the distance in 
words between the fc-th and the fc + l-th occurrence of the marker) then {TiTj) oc 5^ , 
where 5ij is the Krocneker deha. Under these conditions each event carries the 
same amount of information. The statistical independence between the rj. intervals 
means that the information carried by the events is maximal for a given waiting 
time distribution iP{t). In a linguistic jargon, we can say that if in a corpus we 
find a marker (e.g. a list of words) such that S « l/(/^ — 1) then this marker 
is informative in that corpus. For didactical purposes, we shall see that certain 
markers, e.g. punctuation marks, are not real events, but are rather modeled by 
a Copying Mistake Map (CMM) 1^. This means that discourse complex dynamic 
is such that the punctuation marks actually carry long-range correlations, and 
anomalous scaling in the PDF, while the waiting times between such marks are 
correlated. Punctuation marks are not informative. Their complexity is just a 
projection of a complexity carried by "concept dynamics" . 



2.1 The CMM and non-informative markers 

The Copying Mistake Map (CMM) - is a model originally introduced to study 
the anomalous statistics of nucleotides dispersion in coding and non-coding DNA 
regions. The CMM is a combination of two sequences: We have an "original" time 
sequence like e.g. the long-range-correlated series earlier discussed, corresponding 
to the waiting time distribution Q . Then, for any we either leave it unchanged 
with probability e or change it with a completely random value with probability 
1 — e (copying mistake). 

The resulting waiting time distribution decays exponentially, since the probabil- 
ity of finding a 1 after a time t from the preceeding one, is given by two terms. This 
is because the 1 can be associated to two kinds of origin: it may be an "original" 
1, or an original "zero" flipped by the copying mistake. We can write the "experi- 
mental" waiting-time distribution psie2:p(t), in terms of psicorr{t) of the mentioned 
long-range-correlated model and oipsirand{t) of the Poissonian copying process, 
namely 

corr corr it), (6) 

where ^{t) = J^°° dt'ij{t') and ^rand{t) = 1^°° dt'ijjrand{t'), and 



Aandit) = In (^^) • (^^) . (7) 

Since 4'rand(t) E^^d consequently ^^randit) decay as an exponential function, so it 
does, in the asymptotic limit, ipexp{t)- What about the DE curve? The theory 
predicts^^a random (S = 0.5) behavior for short times, a knee, and a slow transition 
to the totally correlated behavior. An example of CMM is given by punctuation 
marks in Natural Language. We choose punctuation marks as markers for an 
Italian corpus of newspaper and magazines, of more than 300,000 words length, 
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called Italian Treebank (hereafter TB). In this experiment we look at words, and 
we put a for every word which is not a punctuation mark, and a 1 when we find 
such a mark (full stops, commas, etc.). A sentence like "Felix, the cat, sleeps!" is 
therefore transformed into "0100101". 

Fig. 2 shows that this markers lead to a time series with all the earlier exposed 
features of a CMM. This means that the waiting times are correlated, and 
therefore punctuation marks are not events. Notice however that an asymptotic 
anomalous 5 is detected by DE, and therefore there is a long-range correlation in 
the text, which may be carried by some other more informative marker. 



2.2 Concepts as informative markers 

More experiments, not reported here, show that the CMM behavior is typical for 
many characters, and are shared by all the letters of the alphabet, with a 5 « 0.6. 
Passing from a "phonetic" (in Italian we can assume that alphabetic characters 
mirror the phonetic) to a morpho-syntactic level is linguistically interesting. To do 
so, a text has to be lemmatized and tagged with respect to its part of speech. After 
this procedure we can identify as a marker the occurrence of a certain part of speech 
(e.g. article, adverb, adjective, verb, noun, preposition, numeral, punctuation etc.). 
For instance, the sentence "Felix, the cat, sleeps!" is now transformed into "N P 
R N P V P", where N, P, R and V stand for nouns, punctuation, article and verb. 
If we select the occurrence of verb as a marker, then we have "0 1 0". 
Fig 3 shows the result of this experiment for verbs and for numerals. We notice 
that we have a similar behavior for the DE, and a completely different behavior 
for the evaluation of the waiting-times distribution ipit) (where t is the number 
of words between markers). We notice that DE reveals a long-time correlation, 
while, tp{t) shows an exponential truncation at long times. However, in the case 
of numerals we find a large transient with a slope finumerais < 2, and therefore a 
non-stationary behavior. This is in fact due to the uneven distribution of numerals 
in the corpus, since they are encountered more often in the economic part of the 
Italian newspapers. However, this still unsatisfactory result for numerals reveals 
that this kind of markers is more informative than a phonetic one or than the 
presence of verbs. This is linguistically interesting, since numerals denote a part of 
speech, but also a "semantic class" . 

We are therefore led to suppose that informative markers are the ones associ- 
ated with a semantically coherent class of words. This is however a problem, since 
every single concept is too rare in a balanced corpus (a long text with a variety 
of genres) . The next level of our exploratory search for events is therefore to look 
at the occurrence of "salient words" in a specialistic text. Such a corpus has b een 
made available as the Italian corpus relative to the European project POESIA^^. 
POESIA is a European Union funded project whose aim is to protect children from 
offensive web contents, like, e.g. pornography in WWW URLs. Salient "porno- 
graphic" words were automatically extracted by comparing their frequency in an 
offensive corpus, with respect to the balanced TB-corpus. The definition adopted 
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Figure 2. (a) Diffusion entropy for punctuation marks. The fit for the asymptotic limit (solid 
line) yields a 5 = 0.7. The dashed line marks a transient regime with 5 = 1. (b) Non-normalized 
distribution of waiting times for punctuation marks, namely counts of waiting times of length t 
between marks in TB. The expression for the dashed line fit is 7000 ■ exp(— 1/7.15). (c) Second 
moment analysis for punctuation marks. The expression for the solid line fit is 0.06 ■ t^'^'"^ . Notice 
that 1.57 ^ 3 — 1/0.7, namely the expression H = 3 — 1/(5 of Ref. 9 for Levy processes stemming 
from CMM's is verified. 



was 



S(l) = fEcjl) - fTBjl) . . 

'^'^ fEc{l) + fTB{iy 

where fscil) is the frequency, in the erotic corpus, of the lemma I, and /ts(0 
is the same property in the reference ItaHan corpus (ItaHan Treebank). SaUent 
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Figure 3. a) DE for verbs (squares) and numerals (circles). The dashed line is a fit for the verbs, 
with 5 = 0.73, while the solid line is a fit for the numerals, with 5 = 0.82. b) '4>{t) for verbs (black 
circles) and numerals (white circles). The dashed line is a fit for the numerals, with fi = 1.2, while 
the solid line is a fit for the verbs, with fi = 3.8. 



lemmata were automatically chosen as the 5% with the highest value of s. Notice 
that in this experiment all "dirty" words are not taken into consideration, because 
they do not appear in the reference corpus, and therefore s cannot be properly 
defined. However an offensive metaphoric use of terms is in fact detec ted , leading 
to a completely new way to automatic text categor izat ion and filter 1^^, using a 
method, based on DE analysis, called CASSANDRA El. 

Salient words were therefore used as markers for our analysis, as earlier de- 
scribed. The results are shown in Fig. 4, clearly showing that in a specialized 
corpus, salient words of this genre, pass the test of informativeness. Salient words, 
and plausibly words in general, are therefore distributed like markers generated 
by an intermittent dynamical model, with fi « 2.1 and, in agreement with 
(5 « — 1) = 0.91. We see in the next section how this behavior is plausibly 
connected with a topological complexity at the paradigmatic level, and in Section 
III we derive the Zipf 's law from the resulting model. 



3 Scale-free netviforks, intermittency and the Zipf's law 

In this section we b uild a cognitive model for connecting structure and dynamics. 
Allegrini et al. identified semantic classes in the Italian corpus, by looking at 
paradigmatic properties of interchangeability of classes of verbs with respect to 
classes of nouns. They defined "superclasses" of verbs and nouns as "substitutabil- 
ity islands" , namely groups of nouns and verbs sharing the properties that in the 
corpus y ou fi nd each verb of the class co-occurring, in a context, with each noun of 
the class This is precisely a direct application of the notion of "paradigm". Let 
us call py (c) and Pn (c) , respectively, the number of verbs or nouns belonging to a 
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Figure 4. a) DE for salient "erotic" words for a corpus of erotic stories and offensive web pages 
(squares), and for the Italian reference corpus (circles). The solid line is a fit with expression 
S{t) = k + S ln{t + to), where the additional parameter to is added to the original Eq. (3) to take 
transients into account and to improve the quality of the fit, yielding 5 = 0.91 b) Non-normalized 
waiting time distribution for salient "erotic" words for a corpus of erotic stories. The expression 
for dashed line fit is 14000 ■ (12.0 + t)~^ \ yielding fi = 2.1. 



number c of classes. They found that 

where 7y is a number whose absolute valu e is (much) smaller than 1. 

On the same line, other authors ^1 found a "small world" topology by 
looking at the number of synonyms in an English thesaurus, for each English lemma. 
We can therefore assume that this kind of structure is general for any language. 
Let us therefore imagine that the paradigmatic structure of concepts is a scale-free 
network and consider a random walk in this "cognitive space" . Let us make the 
following assumptions: 

1) The statistical weight of the i-th node is uji ~ q; 

2) Ergodicity, and therefore that the characteristic recurrence time is ti ^ c^^ ; 

3) The same form for all nodes,ipi{t) = {l/Ti)F {—t/Ti) (e.g. F{x) — e:xjp{—x)). 
Now we imagine that selecting a concept means selecting a few neighboring nodes. 
This collection of nodes, due to the scale-free hypothesis, shares the same scaling 
properties of the complete scale free network, namely p(c) ~ c~'^. Therefore we 
have that 

/I 1 
dcc^e"^*— ~ (10) 

We recovered the intermittent model @. 

Let us now make the exercise of deriving the Zipf's law / oc r^°, with a close 
to unity. Let us define a probability of frequency P{f) 
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P{f)df = prob{r)dr =^ P(/) ^ f-^ 



(11) 



Next, let us notice that P{f) must be a stable distribution. In fact, the Zipf's law 
is valid for every corpus. In particular if it is valid for corpus A and for corpus B, 
it is valid also for the corpus A + B where + means the concatenation of corpora. 
If we continue with concatenating we will have a corpus 



and we write the frequency of a word in the total corpus, ftot is written in terms of 
the single frequencies /i, /2, . . ., and total lengths iVi, A''2, . . . of the single corpora 



i.e., the Generalized Central Limit Theorem applies. This means that the proba- 
bility of frequency P{f) is a Levy a-stable distribution. This probability of finding 
/ occurrences of a word in a corpus of a given length can be identified with p{x; t) 
of Section II, if we take into consideration the parameter t. We have earlier no- 
ticed that p{x; t) in language is Levy process, with (5 ~ 1, and therefore with a tail 
P{f) ~ /^^- In other words through l(TT|l we recover i.e. the Zipf's law. 

4 Conclusions 

In this paper we have shown that a cognitive process governing human language 
may be identified, and that it has a complexity both in the syntagmatic and in 
the paradigmatic axis. The scaling properties of both axes are related to each 
other, and are reflected by the celebrated Zipf's law. This study was conducted 
using Italian written corpora, but decades of studies on the generality of the Zipf's 
law lead us to suppose that our results are language independent, and that the 
language complexity that we are revealing is genuine and important. In fact, for 
any concept, we have a scaling index associated with an intermittent dynamical 
model that rests at the border between ergodicity and non-ergodicity, since the 
Zipf's law is theoretically consistent with 5 = 1. Moreover, in a specialistic test 
we see a tendency to drift, for salient words, towards ergodicity {5 k, 0.91 in the 
reported experiment). This behavior can be interpreted as the balance between 
two opposite needs for human language, namely learnability, i.e. the possibility 
for a child to learn a language by examples, and variability, to explore an infinite 
cognitive space. 

We propose as a future work to study language complexity in children during 
learning years, and in psychopathological subjects. We imagine, if the theory pre- 
sented herein is validated by more extensive work, that the simple study of the 
individual Zipf's laws can provide a reasonable non-invasive diagnostic method for 
certain mental diseases. 

From a Language Engineering point of view, this study provides a theoretical 
background for a completely new strategy of automatic text categorization. A pro- 
totype is being implemented as a semantic filter We think that the proposed 



Total Corpus = Corpus A -I- Corpus B -I- ... 




(12) 
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test for informativeness for a set of markers can also be important for many ex- 
ploratory studies in time series analysis. For instance, it may become important to 
identify crucial semantic markers in a flow of data. 
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